feat: Fine-Grained Authorization with Keycloak-Core Policy Engine#607
feat: Fine-Grained Authorization with Keycloak-Core Policy Engine#607lakhansamani wants to merge 46 commits into
Conversation
…methods Add 7 new collection/table names (Resource, Scope, Policy, PolicyTarget, Permission, PermissionScope, PermissionPolicy) to CollectionList and Collections var. Add 28 new method signatures to the Provider interface covering CRUD for resources, scopes, policies, policy targets, permissions, and join tables, plus the optimized GetPermissionsForResourceScope evaluation query.
Add CRUD methods for resources, scopes, policies, policy targets, permissions, permission scopes, and permission policies. Implement GetPermissionsForResourceScope optimized JOIN query for the evaluation engine. Add 7 new schemas to AutoMigrate.
Add CRUD methods for all 7 authorization collections using MongoDB driver. Implement GetPermissionsForResourceScope using sequential lookups. Create indexes for efficient name and foreign key lookups.
…ders Add authorization CRUD methods for all remaining NoSQL providers. Each provider follows its existing patterns for collection access, querying, and pagination. All 6 providers now implement the full 28-method authorization storage interface.
Add principal-agnostic policy evaluation engine with: - Role-based and user-based policy evaluators (extensible to client/agent) - Affirmative and unanimous decision strategies - MaxScopes delegation ceiling enforcement - Input validation (safe characters, known resource/scope checks) - In-memory cache with negative caching and prefix invalidation - Three enforcement modes (disabled, permissive, enforcing)
Add SetCache, GetCache, DeleteCacheByPrefix to Redis, DB-backed, and in-memory memory store providers. Update fakeMemoryStore in tests to implement the new interface methods.
Add --authorization-enforcement, --authorization-cache-ttl, --include-permissions-in-token, --authorization-log-all-checks flags. Wire authorization provider in cmd/root.go initialization order. Pass authorization provider to http_handlers and graphql dependencies.
Add 10 output types, 9 input types, 12 admin mutations, and 6 queries for the authorization model. Regenerate GraphQL code. Wire resolver stubs with placeholder implementations until Phase 10.
Add 18 GraphQL handler methods: CRUD for resources, scopes, policies, permissions, plus check_permission and my_permissions user-facing queries. Add AsAPI conversion methods to all authorization schemas. Wire resolvers to graphql Provider interface.
Add POST /api/v1/check-permission endpoint for downstream services.
Extracts principal from JWT Bearer token, evaluates permission via
authorization engine, returns {allowed, matched_policy} JSON response.
Add 16 integration tests covering resource/scope/policy/permission CRUD, permission evaluation with role-based policies, referential integrity checks on delete, and cleanup. All tests pass with SQLite.
Add 5 authorization pages to admin dashboard: Resources, Scopes, Policies, Permissions, and Evaluate. Include guided setup flow, natural language permission summaries, and policy evaluation test tool. Add Authorization nav item to sidebar.
- H-1: Make cache invalidation synchronous (remove goroutine) to prevent stale authorization decisions after policy changes - H-2: Add allowlist validation for policy type, logic, and decision strategy in add/update handlers -- prevents silent permission escalation from typos - C-2: Fix evaluateRoleTargets unanimous strategy returning true when no role targets exist (empty-target bypass) - C-3: Fix DeletePolicy handler ordering -- storage provider now handles cascade deletion, preventing data loss on referential integrity failure - M-3: Add name format validation (alphanumeric, hyphens, underscores, max 100 chars) in resource/scope/policy/permission add/update handlers
Upgrade to latest available versions to address CVE-2026-33816 and CVE-2026-33815 in pgx. Note: both CVEs have no upstream fix yet (Fixed in: N/A) but govulncheck confirms the affected symbols (Backend.Receive, FunctionCall.Decode, Bind.Decode) are not called by Authorizer code. Also upgrades gorm v1.25.5->v1.25.10.
… for unmatched checks
…led' with one-time log
…ermission-probe errors
…t cache, DoS caps
…ts, test coverage
…fer cache decode, shared identifier constant
…atePermission, duplicate detection
…tribution, dashboard tab paths - cmd: emit one-time WARN when --authorization-enforcement is unrecognized (legacyTypoObserved was set but never read, so typos like "enforcin" silently demoted to permissive with no operator signal). - evaluator: track first deny policy through the per-permission loop and return it as matched_policy on Allowed=false, so audits can attribute explicit denies instead of seeing a bare null. - routes: NoRoute fallback serves the SPA shell for any unmatched GET under /dashboard/ or /app/, fixing 404 on refresh/bookmark of multi-segment SPA paths (e.g. /dashboard/authorization/resources). - routes/handlers: send Cache-Control: no-cache on the SPA shell HTML and the unhashed entry assets (index.js, main.css), and immutable long-cache on the content-hashed chunks. Prevents post-deploy users from holding a cached shell that points at chunks the new build no longer publishes. - dashboard Authorization.tsx: NavLinks now use absolute paths so clicking a tab from /dashboard/authorization/resources doesn't compound into /dashboard/authorization/resources/scopes (which the inner Routes can't match → tab pages rendered blank).
…leanup TestSession was intermittently failing with "unauthorized" because subtests picked a session_token by iterating MemoryStoreProvider.GetAllData(). Session() rotates the cookie and deletes the old token in an async goroutine, so the map transiently held both old and new tokens — map iteration randomness could land on the token about to be deleted by the rollover goroutine. Replaced the racy iteration with latestAppSessionCookie(ts), which reads the rotated cookie from the gin response writer's Set-Cookie headers (deterministic, race-free). Also softened the storageProvider.Close() failure path in initTestSetup's t.Cleanup: fire-and-forget goroutines from RegisterEvent / LogEvent / AddSession can hold pool connections past test body completion, surfacing Close errors even though test logic succeeded. Switched t.Errorf to t.Logf so cleanup noise no longer fails the parent test (which previously manifested as bare "--- FAIL: TestX" with no failing subtest). Verified with 8 consecutive `make test` runs (all pass) and `go test -count=50 -run TestSession` (50/50 pass).
Session() previously rotated the cookie+memory in two stages: a fire-and-forget goroutine deleted the OLD session/access/refresh trio, while SetUserSession for the new tokens ran synchronously. This left a window — bounded only by Go scheduler latency — where a stolen pre-rotation token was still accepted alongside the rotated one. It also made integration tests racy: any code that inspected MemoryStoreProvider state after Session() could see either set. Move DeleteUserSession into the synchronous path, ordered AFTER the new session is fully established, so there is never a moment with no valid token and never a moment with both. Delete failure remains non-fatal (log and continue) since the new session is already live. The op is in-memory or a single Redis DEL — sync cost is negligible. Also migrate profile_test.go and validate_session_test.go off the racy MemoryStoreProvider.GetAllData iteration pattern to latestAppSessionCookie — those endpoints don't rotate today, but the pattern is fragile and these tests now share the same race-free helper as TestSession. Includes a pre-existing enforceRequiredPermissions hook in Session() from the required_permissions feature on this branch.
…esh stale REST-endpoint comments Three follow-up gaps surfaced while auditing fine-grained-authorization observability: 1. Authz CRUD operations (permission/policy/resource/scope x add/update/delete) emitted no audit log entries. Login, signup, profile updates, webhook and email-template CRUD all audit; changes to who-can-do-what did not. Added 12 audit event constants and a matching AuditProvider.LogEvent call to each of the 12 resolvers, plus 4 new audit resource type constants (authz_permission/policy/resource/scope) so downstream consumers can filter on object type without parsing action names. 2. authorizer_authz_checks_total test coverage previously only exercised the unmatched_allowed label. Added TestCheckPermission_ResultLabels_- IncrementCorrectCounter with subtests for allowed, denied, unmatched_denied, and error, each constructing the exact seed shape that lands on the target terminal path in CheckPermission and asserting a single increment on the matching counter series. Includes a seedResourceScopePermissionAllowingRole helper that mirrors the existing deny-policy seed. 3. Comments in evaluator.go and authorization_test.go referenced /api/v1/check-permission as the DoS-guard rationale; that endpoint was removed on this branch. Updated the comments to describe the actual surface (authenticated GraphQL myPermissions / required_permissions inputs). All authz / metrics tests pass; full `make test` is green.
Backend already calls validatePolicyTargets from add_policy and update_policy (committed previously in 3d11699's audit), but the helper itself was never added to the tree. This commit lands it and its unit tests. Rules enforced: - target_type must equal the policy's type (role|user) so storage and evaluator agree on how to match. - target_value is non-empty after trim. - For role policies, target_value must be one of --roles so a typo cannot silently produce a dead policy that never matches. User targets are not looked up in the users table here — the lookup is per-target and races deletes; the evaluator no-ops on missing IDs. Adds 6 table-driven test cases covering valid role / user / mismatched target_type / unknown role / empty value / empty targets.
… retire Evaluate tab Three interlocking FGA UX changes that share schema regeneration. 1. Required-permissions field on session APIs New optional required_permissions: [PermissionInput!] on session, validate_session, validate_jwt_token. AND semantics — any deny or unmatched (resource, scope) returns "unauthorized". Helper lives in internal/graphql/permission_check.go; integration coverage spans the backward-compat (no field), granted, and denied paths for all three endpoints. validate_jwt_token also picks up a role-claim fallback so access tokens (claim "roles") work alongside id tokens (configured JWTRoleClaim, typically "role"). 2. Drop the public check_permission surface The standalone check_permission GraphQL query and POST /api/v1/check-permission REST handler are removed. Required-permissions subsumes that workflow inside the authenticated session endpoints and avoids exposing the evaluator as an unauthenticated probe target. Schema types renamed accordingly: AuthzResourceScope → Permission, CheckPermissionInput → PermissionInput. 3. Retire the Evaluate dashboard tab The tab was the only consumer of check_permission. Removed the tab, the Evaluate.tsx page, the CheckPermissionQuery client, and the matching TS types. The /authorization/* catch-all now sends unknown subpaths back to Resources so a stale bookmark does not 404. Polish: - Dashboard FGA forms gained inline help text under name/description/type/ decision-strategy/targets fields explaining the validation rules and the user-ID-not-email expectation. - make dev now passes --authorization-enforcement=enforcing so the local developer loop matches production semantics. - Stale "check_permission" wording in the cmd/root.go startup warn line swapped for "authorization checks". Regenerated GraphQL bindings via make generate-graphql.
The embedded GraphQL playground loads React + GraphiQL bundles from cdn.jsdelivr.net at runtime. The global defaultCSP whitelists only self, unsafe-inline, and editor.unlayer.com — so the playground page rendered but every script and stylesheet from jsdelivr was CSP-blocked, leaving GraphiQL undefined and the page non-functional. Introduce a second policy constant, playgroundCSP, that swaps the unlayer allowlist for jsdelivr on script-src, style-src, and font-src and tightens connect-src to self (the playground only talks to /graphql, not api.unlayer.com). The middleware picks playgroundCSP when c.Request.URL.Path is exactly /playground and defaultCSP everywhere else, so the rest of the app retains its stricter posture.
Wire the endpoint label through enforceRequiredPermissions so each call site (session, validate_session, validate_jwt_token) emits authorizer_required_permissions_checks_total with a bounded endpoint label; add integration test asserting counters increment per outcome.
Session subtests each call login() internally (session rotates on every successful call), so the top-level accessToken captured before those subtests run is evicted from the memory store. The metrics subtest now calls login() + captureTokens() at its own start to get a fresh token.
…gaps Three small comment tweaks from code review on commit 0c6e80c: - enforceRequiredPermissions doc now states the not_requested path still emits a metric (callers see no error but observability is preserved), and a follow-on note explains the early-return invariant so a future refactor that collects all failures doesn't silently double-count. - The denied subtest's `_, _ = ...` discard line now states the intent (error is expected, only the counter matters) and notes that the outcome=error path is not exercised here because integration tests cannot synthesize a CheckPermission storage fault without injection hooks the provider doesn't expose today.
Code-review fallout from Tasks 3+4: a residual permissive-mode test
asserted Allowed:true under a setup that now always denies, two tests had
"Permissive"/"Enforcing" qualifiers in their names that no longer mean
anything, and testSetupWithAuthzMode wrote to a config field the
authorization provider no longer reads.
- Deleted TestCheckPermission_PermissiveDefault_NoPermissions_Allows; its
premise ("permissive allows unmatched") is gone.
- Renamed TestCheckPermission_Enforcing_NoPermissions_Denies →
TestCheckPermission_NoPermissions_Denies and dropped the redundant
mode qualifier from its doc + failure message.
- Renamed TestCheckPermission_Permissive_WithExplicitDenyPolicy_StillDenies
→ TestCheckPermission_ExplicitDenyPolicy_Denies (the explicit-deny
invariant survives the dual-mode removal; only the name needed work).
- Replaced 9 remaining testSetupWithAuthzMode(..., Enforcing) call sites
with initTestSetup(t, getTestConfig()) and deleted the helper itself.
- Refreshed the validateResourceExists doc comment, which still narrated
the permissive-mode fall-through reason that no longer exists.
NormalizeAuthzEnforcement tests at lines 844-873 are left intact — Task 5
will delete the function and its tests together.
Delete the AuthorizationEnforcement config field, NormalizeAuthzEnforcement function, and all legacy permissive/disabled handling. The deprecated --authorization-enforcement CLI flag is kept as a no-op no-parse-error shim for one release, with a startup warning when the operator passes it. Startup probe wording updated to drop permissive-mode references. Deleted 5 NormalizeAuthzEnforcement unit tests from authorization_test.go.
After Task 5 removed the only Go-code consumers (NormalizeAuthzEnforcement, the CLI flag's defaulting logic, and the runRoot mode-switch), the two AuthorizationEnforcement* constants are orphaned. Drop both. Full integration suite still green.
…tric CHANGELOG.md: under [Unreleased] add a Breaking changes section noting the always-enforcing posture and the Prometheus label collapse, and an Added entry for authorizer_required_permissions_checks_total. MIGRATION.md: append a stand-alone "Authorization Enforcement Removal" section with the pre-upgrade audit (authz.unmatched signals to act on before flipping versions), the flag-removal action item, the dashboard relabel guidance, and a per-outcome table for the new counter. Calls out outcome=error as the only outcome that warrants paging.
Update: Authorization enforcement flag removed; new per-endpoint metricLatest commits on this branch ( WhyThe flag was a footgun: its default ( What changed
Test plan
Companion docs PR (separate repo)The |
… entirely The flag was added on this branch and never merged to main, so the deprecation shim and runRoot warning are dead weight — no operator out there has it in a systemd unit or docker-compose. Remove the cobra registration, the runRoot warning, and reframe the docs: - CHANGELOG: rewrite the FGA-enforcement entry under 'Changed' as "always enforcing; the previously-proposed flag and dual modes were removed before shipping" instead of a deprecation breaking-change. - MIGRATION: replace the "Authorization Enforcement Removal" section (which assumed users had permissive mode to migrate away from) with a quick-start "Fine-Grained Authorization — new in v2" that explains the model, the per-call adoption pattern, the observability counter, and the startup probe. Behavior unchanged for end users — required_permissions checks against an undefined or denied (resource, scope) still return unauthorized; the new metric still emits per-endpoint outcomes.
Summary
Implements RFC #508 — Fine-Grained Authorization with a Keycloak-inspired four-pillar model (Resources, Scopes, Policies, Permissions). Replaces flat comma-separated role strings with a composable, principal-agnostic authorization engine that's always enforcing and opt-in per call.
How callers consume it
Three GraphQL operations gained an optional
required_permissions: [PermissionInput!]field — AND semantics, any deny or unmatched(resource, scope)returnsunauthorized:sessionvalidate_sessionvalidate_jwt_tokenPre-existing callers that omit the field see no behavior change.
Authenticated users can list their own grants via the
my_permissionsquery. Admins manage the policy graph via_add_resource/_add_scope/_add_policy/_add_permission(plus list / update / delete for each).Key features
affirmative(any policy grants) andunanimous(all must grant); explicit deny always winsPrincipalabstraction with optionalMaxScopesdelegation ceiling--authorization-cache-ttl(default 300s, 0 to disable)--rolesvalue at create/update time so a typo cannot silently produce a dead policy--authorization-enforcementflag and its dual modes were dropped before shippingCLI flags added
--authorization-cache-ttl(int, default300) — cache TTL in seconds;0disables.--include-permissions-in-token(bool, defaultfalse) — embed grants in JWT access tokens.--authorization-log-all-checks(bool, defaultfalse) — audit-log every check, not just denials.No CLI flag controls enforcement — authorization is unconditionally enforcing.
Observability
Two Prometheus counter families plus a histogram:
authorizer_authz_checks_totalresult(allowed|denied|unmatched|error)CheckPermissioncall.authorizer_authz_unmatched_total(resource, scope). Use to find policy-graph gaps.authorizer_authz_check_duration_secondsauthorizer_required_permissions_checks_totalendpoint(session|validate_session|validate_jwt_token),outcome(granted|denied|not_requested|error)outcome="error"— should sit at zero.Startup probe emits
authz: 0 permissions configured — all authorization checks will DENYwhen the database has zero permissions, so operators don't lock themselves out silently.Authz CRUD operations (add/update/delete for permission/policy/resource/scope) emit audit-log entries for compliance.
Storage
authz_resources,authz_scopes,authz_policies,authz_permissions) and 3 join tables, replicated across all 13 storage providers (PostgreSQL, MySQL, SQLite, SQL Server, YugabyteDB, MariaDB, PlanetScaleDB, CockroachDB, LibSQL, MongoDB, ArangoDB, Cassandra/ScyllaDB, Couchbase, DynamoDB).storage.Providerinterface — full CRUD plus the optimizedGetPermissionsForResourceScopejoin used on the evaluator hot path.Backward compatibility
User.Roles,IsSuperAdmin(), JWT format — unchanged.required_permissionsis per-call opt-in. Callers that omit it preserve pre-FGA semantics exactly.--include-permissions-in-token=true.Future-proof for M2M and AI agents
Principal{ID, Type, Roles, MaxScopes}acceptstype: "user" | "client" | "agent". Adding a new principal type is one switch case, no schema migration.MaxScopesis a delegation ceiling: even if policies grant more, scopes outside the ceiling are denied.Security guards
(resource, scope)input reaching authenticated GraphQL.validateResourceExists— a transient DB blip surfaces as an error to the caller rather than asAllowed: true.(user, resource, scope)results cannot collide.UpdatePermissionwith compensating rollback on failure mid-write.Test plan
make testgreen — every package passes against SQLite, includinginternal/authorization,internal/graphql,internal/metrics,internal/storage, and the fullinternal/integration_testssuite (~54s).internal/storagepackage —TestStorageProvidercovers every storage method on SQLite (Resource / Scope / Policy / Permission CRUD, theGetPermissionsForResourceScopejoin, MFA Session, Audit Log, OAuth State, etc.).TestAuthorizationCRUD(full CRUD on resources/scopes/policies/permissions + role-based grant check + admin-role deny check),TestCheckPermission_NoPermissions_Denies,TestCheckPermission_ExplicitDenyPolicy_Denies,TestCheckPermission_ExplicitDenyOverridesAffirmativeGrant,TestCheckPermission_CacheKeyIncludesRoles,TestUpdatePermission_InvalidScopeDoesNotDropExistingLinks,TestAddPermission_DuplicateNameReturnsConflict,TestCheckPermission_IncrementsPrometheusCounters,TestCheckPermission_UnknownResource_DeniesAndDoesNotBumpUnmatchedCounter,TestCheckPermission_ResultLabels_IncrementCorrectCounter,TestCheckPermission_MaxScopes_*,TestCheckPermission_UnanimousDecisionStrategy_AllPoliciesMustAgree,TestCheckPermission_UserTypePolicy_MatchesOnPrincipalID.required_permissionsintegration test —TestRequiredPermissionscovers backward-compat + granted + denied paths forsession,validate_session,validate_jwt_token, plusmetrics counters increment per outcomeassertion that verifies the new counter increments ongranted,denied, andnot_requested.TestValidatePolicyTargets(6 cases: valid role / valid user / mismatched target_type / unknown role / empty value / empty targets).go build ./...clean.go vet ./...clean except a pre-existing mongodb context-leak hint unrelated to this branch.npm run buildpasses.make test-all-db(Docker, runs the sameTestStorageProvidersuite across postgres / sqlite / mongodb / arangodb / scylladb / dynamodb / couchbase). Pending.make dev, log into dashboard, define a resource → scope → policy → permission, then callvalidate_jwt_tokenwithrequired_permissions: [{resource, scope}]and confirm granted / denied / unauthorized responses match policy intent.Docs follow-up
Companion PR against
authorizer-docsaddscore/authorization.md(model reference, admin mutations,required_permissionsusage, decision strategies, observability), updatescore/graphql-api.mdfor the new field +my_permissionsquery, refreshescore/metrics-monitoring.mdwith the new counter, and adds an FGA section tomigration/v1-to-v2.md. Plan written; PR opens after this lands.Design specs
docs/superpowers/specs/2026-04-13-fine-grained-authorization-design.md— original FGA design